| Locality | Median Vaccination Rate (%) | IQR |
|---|---|---|
| Metro | 57.9 | ( 47.7-66 ) |
| Non-metro | 49.5 | ( 42.3-57.5 ) |
Figure 1. Scatterplot and linear regression overlay with median household income as the predictor and percent of county vaccinated as the outcome
Figure 2. Scatterplot and linear regression overlay with unemployment rate as the predictor and percent of county vaccinated as the outcome
Figure 3. Scatterplot and linear regression overlay with unemployment rate as the predictor and percent of county vaccinated as the outcome
Median income and percent poverty are highly correlated (-0.77). Further, percent bachelors is strongly correlated with median income (0.62). We decided to remove median income due the strong correlation with both our main predictor and percent poverty.
| pct_vax | pct_bachelors | unemployment | median_income | pct_poverty | |
|---|---|---|---|---|---|
| pct_vax | 1.00 | 0.46 | 0.19 | 0.40 | -0.25 |
| pct_bachelors | 0.46 | 1.00 | -0.04 | 0.62 | -0.35 |
| unemployment | 0.19 | -0.04 | 1.00 | -0.14 | 0.31 |
| median_income | 0.40 | 0.62 | -0.14 | 1.00 | -0.77 |
| pct_poverty | -0.25 | -0.35 | 0.31 | -0.77 | 1.00 |
Before running the LASSO and Decision Tree Models, we used a 5-fold cross-validation, 5 times repeated, creating a resample object for the training data with these specifications. We ran a LASSO model, training 30 penalized linear regressions in order to tune the model to select our best fit. We visualize the validation set metrics by plotting the RMSE against the range of penalty values in figure 4. This plots shows that model performance is generally better at the smaller penalty values. This suggests that the majority of the predictors are important to the model. We also see a steep increase in the RMSE towards the highest penalty values. This happens because a large enough penalty will remove all predictors from the model, causing the predictive accuracy to drop. After running the LASSO model, all predictors remained in the model. The best performing LASSO model had a penalty of 0.0452 and RMSE of 12.2, indicating that it performed better than the null model. We can see the optimal penalty where all the lines meet on the figure on the right.
Figure 4. Lasso Tuning Plots and Model Diagnostics
We ran a decision tree model using the same 5-fold cross-validation above and tuned the hyperparameters to improve the model performance. The best performing Decision Tree model had a cost complexity of 0.000562, tree depth of 8 and RMSE of 12.7, indicating that it performed better than the null model, but not quite as well as the LASSO. We also used the decision tree model to estimate variable importance. This can be seen in figure 5. Here we see that our main predictor, percent with bachelor’s degree appears to be most important, while locality is the least important in the model.
Figure 5. Decision Tree Plot of Important Variables and Model Diagnostics
We first calculated the RMSE for the null model, which equaled 14.5. Then we ran a full model with all predictors and plotted diagnostics to compare the fits with the null model. The RMSE for the model with all predictors was 12.1, indicating that the full model performed better at reducing the RMSE than the null model. We repeated these steps again, but for a model with only the main predictor and calculated RMSE of 12.8, indicating that the simple model performed better than the null, but not quite as well as the full model. Below are the the diagnostic plots for the complex model and the simple model.
Figure 6. Model Diagnostics for Full Model and Simple Model on Train Data
We also ran univariate models with our other predictors and plotted diagnostics and calculated the RMSE for each. For the univariate model with only unemployment and our outcome, the RMSE was 14.2, which is close to the RMSE of the null model, which was 14.5. This suggests that it might not be adding much to the model. For the univariate model with only poverty and our outcome, the RMSE was 14.0, which is also close to the RMSE of the null model, suggesting that it might not be adding much to the model. For the univariate model with only locality and our outcome, the RMSE was 14.2, suggesting that it might not be adding much to the model.
Figure 7. Model Diagnostics for Simple Models (Unemployment and Poverty) on Train Data
Figure 8. Left: Predicted vs Observed from Final Model with only percent bachelor’s degree on test data. Right: Predicted vs Observed from Final Model with only percent bachelor’s degree on Test Data
| Missing | Percent | |
|---|---|---|
| FIPS | 0 | 0.00 |
| pct_vax | 56 | 1.78 |
| locality | 1 | 0.03 |
| county | 0 | 0.00 |
| state | 0 | 0.00 |
| pct_bachelors | 8 | 0.25 |
| unemployment | 0 | 0.00 |
| median_income | 0 | 0.00 |
| pct_poverty | 0 | 0.00 |
Note: This was not run on the train data, but all of the data before we implemented ML methods
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 38.495765 | 0.520051 | 74.02306 | 0 |
| pct_bachelors | 1.007582 | 0.034619 | 29.10487 | 0 |
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.21598 | 0.215725 | 12.6557 | 847.0936 | 0 | 1 | -12174.83 | 24355.66 | 24373.76 | 492512.9 | 3075 | 3077 |